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U      INTRODUCTION 

Clustering  is  the  intelligent  partitioning  of  a  collection  of 
entities.  Specifically,  it  is  the  process  of  dividing  entities  (objects, 
observations,  measurements,  data,  etc.)  into  categories  that  are  meaningful 
or  useful  for  some  purpose.  It  is  one  of  the  fundamental  operations 
people  use  to  simplify  descriptions  of  their  environment,  and  by  that,  to 
improve  the  efficiency  of  their  decision  making.  Appropriate  clustering 
reveals  the  underlying  structure  of  the  given  set  of  objects,  and  hence 
clustering  can  be  viewed  as  a  form  of  knowledge  acquisition. 

Clustering  problems  pervade  many  fields,  particularly  experimental 
sciences  such  as  biology,  chemistry,  geology,  medicine,  etc.  Intelligent 
partitioning  of  objects  can  also  be  an  important  capability  of  autonomous 
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or  semi-autonomous   robots  designed  for  exploration  of  special  environments 
(e.g.,   the  bottom  of  an  ocean  or  the  surface  of  a  planet).  Consequently, 
understanding   the  nature  of  clustering  is  not  only  of  scientific   interest, 
but  also  of  significant  practical  importance. 

A  conventional  view  of  clustering  is  that  it  is  a  process  of 
partitioning  objects  into  groups  such  that  the  degree  of  similarity  (or 
"natural  association")  is  high  among  objects  of  the  same  group,  and  low 
among  the  objects  of  different  groups.  The  notion  of  the  degree  of 
similarity  is  therefore  fundamental  to  this  viewpoint.  A  great  variety  of 
different  similarity  measures  have  been  developed  and  used  in  various 
clustering  techniques.  Frequently  a  reciprocal  of  a  distance  measure  is 
used  as  a  similarity  function.  The  distance  measure  for  such  purposes, 
however,  does  not  have  to  satisfy  all  the  postulates  of  a  distance 
function  (specifically,  the  triangle  inequality).  A  comprehensive  review 
of  various  distance  and  similarity  measures  is  provided  in  Diday  and  Simon 
[1]  and  Anderberg  [2].  Backer  [3]  describes  a  fuzzy  similarity  measure 
based  on  the  theory  of  fuzzy  sets. 

To  determine  the  similarity  of  objects,  a  measure  of  similarity  is 
applied  to  symbolic  descriptions  of  objects  (data  points).  Such 
descriptions  are  typically  vectors,  whose  components  represent  scores  on 
selected  qualitative  or  quantitative  variables  used  to  describe  objects. 
The  underlying  assumption  is  that  if  the  similarity  function  has  high  value 
for  the  given  descriptions,  then  the  objects  represented  by  the 
descriptions  are  similar.  The  similarity  relationship  between  any  two 
objects  in  the  population  to  be  clustered  is  thus  reduced  to  a  single 
number  —  the  value  of   the  similarity  function  applied  to  symbolic 
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descriptions  of  objects. 

Conventional  measures  of  distance  are  "context-free,"  i.e.,  the 
distance  between  any  two  data  points  A  and  B  is  a  function  of  these  points 
only,  and  does  not  depend  on  the  relationship  of  these  points  to  other  data 
points: 

Similarity(A,B)  =  f(A,B)  (1) 

For  example,  for  any  conventional  distance  measure,  the  distance 
between  points  A  and  B  is  the  same  as  between  B  and  C  (Fig.l). 


•  _  • 

•   •   •  • 


••  • 


An  illustration  of  the  context-free  distance 
Fig.l. 

Recently  some  authors  have  been  introducing  "context-sensitive" 
measures  of  similarity: 

Similarity(A,B)  =  f(A,B,E)  (2) 

where  the  similarity  between  A  and  B  depends  not  only  on  A  and  B,  but  also 
on  the  relationship  of  A  and  B  to  other  data  points,  represented  in  (2)  by 
E. 

For  example,  Gowda  and  Krishna  [4]  defined  the  so-called  "mutual 
neighborhood"  distance  measure.  If  point  A  is  the  nth  closest  point  to  B 
and  B  is  the  mth  closest  point  to  A,  then  the  mutual  neighborhood  distance 
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between  A  and  B  is  n-hn.  These  authors  have  demonstrated  that  a  method  using 
such  a  distance  measure  can  solve  some  clustering  problems  which  methods 
based  on  the  "context-free"  distance  cannot. 

Both  previous  clustering  approaches  cluster  data  points  only  on  the 
basis  of  knowledge  of  the  individual  data  points.  Therefore  such  methods 
are  fundamentally  unable  to  capture  the  "Gestalt  property"  of  objects, 
i.e.,  a  property  which  is  characteristic  to  certain  configurations  of 
points  considered  as  a  whole,  and  not  as  a  collection  of  independent 
points.  In  order  to  detect  such  properties,  the  system  must  know  not 
only  the  data  points,  but  also  certain  "concepts".  To  illustrate  this 
point,  let  us  consider  a  problem  of  clustering  data  points  in  Fig.  2. 

A  person  considering  the  problem  in  Fig.  2  would  typically  describe  it 
as  "a  circle  on  top  of  a  rectangle." 

•   #   • 


An  illustration  of  conceptual  clustering 
Fig.  2. 


Thus,  the  points  A  and  B,  although  being  very  close,  are  placed  in  separate 
clusters.   Here,  human  solution  involves  partitioning  the  data  points  Into 
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groups  not  on  the  basis  of  pairwise  distance  between  points,  but  on  the 
basis  of  "concept  membership."  That  means  that  the  points  are  placed  in  the 
same  cluster  if  together  they  represent  the  same  concept.  In  our  example, 
the  concepts  are  a  circle  and  a  rectangle. 

The  approach  to  clustering  which  clusters  objects  into  groups 
representing  a  priori  defined  conceptual  entities  is  called  "conceptual 
clustering."  A  link  between  conceptual  clustering  and  distance-based 
clustering  methods  can  be  established  by  stating  that  in  conceptual 
clustering  the  similarity  between  the  data  points  is  a  function  of  these 
points,  context  E,  and  a  set  of  predefined  concepts  C: 

Similarity(A,B)  -  f(A,B,E,C)  (3) 

The  approach  has  been  introduced  by  Michalski  [5].  It  evolved  from 
earlier  work  by  the  author  and  his  collaborators  on  the  problem  of 
generating  "uniclass  covers."  Such  covers  are  disjunctive  descriptions  of 
a  class  of  objects  learned  from  only  positive  examples  of  the  class.  Stepp 
[6]  describes  a  computer  program  and  various  experimental  results  on 
determining  uniclass  covers.  His  work  is  concerned  with  what  can  be  called 
'free'*  conceptual  clustering. 

The  idea  that  the  similarity  measures  of  the  type  (1)  or  (2)  (the 
"concept-free"  measures)  may  be  inadequate  for  some  clustering  problems  is 
not  new.  In  the  past,  several  authors  noticed  this  problem  and  proposed 
various  solutions.  For  example,  Watanabe  [7,8]  proposed  the  concept  of 
"cohesion"  to  measure  the   "degree  of   clusterness"  of  points,   which 


*In  "free"  clustering  the  number  of  clusters  is  not  predefined,  as  opposed 
to  "constraint"  clustering  where  the  number  of  clusters  is  assumed  a 
priori. 
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utilizes  the  entropy  measure.  Using  this  concept  he  was  able  to  resolve 
the  "three  girls  in  the  dormitory"  paradox,  which  cannot  be  solved  by 
"concept-free"  methods  .  Other  measures  of  "cohesiveness"  of  objects  were 
proposed  on  the  basis  of  graph-theoretic  considerations,  e.g.,  Matula  [9], 
Auguston  and  Minker  [10],  Zahn  [11],  Cheng  [12]. 

This  paper  presents  a  theoretical  basis  and  an  algorithm  for 
conceptual  clustering,  where  conceptual  entities  are  conjunctive  statements 
in  variable-valued  logic  calculus  VL  [13]  (which  is  a  typed  many-valued 
logic  extension  of  propositional  calculus).  These  statements,  called  VL 
complexes,  are  logical  products  of  relational  statements  involving  discrete 
variables  of  an  arbitrary  number  of  values  (definition  2  and  3  in  the  next 
chapter).  Complexes  have  a  simple  linguistic  interpretation  and  are  able 
to  express  consisely  a  large  class  of  relationships  among  discrete 
variables.  The  algorithm  combines  the  methodology  of  optimization  of 
variable-valued  logic  expressions  [14]  with  the  dynamic  clustering  method 
[1].  Its  theoretical  foundation  is  a  special  property  of  complexes 
fomulated  as  the  Sufficiency  Principle  (section  3). 


2.      COMPLEXES  AS  CONCEPTUAL  ENTITIES  FOR  CLUSTERING:  BASIC  DEFINITIONS 

Let  x,,  x„,  ...,  x  denote  discrete  variables  which  are  selected  to 
12        n 

describe  objects  in  the  population  to  be  clustered.  For  each  variable  a 
value  set  or  domain  is  defined,  which  contains  all  possible  values  this 
variable  can  take  for  any  object  in  the  population.  We  shall  assume  that 
the  value  sets  of  variables  x.,  i=l,2,...,n  are  finite,  and  therefore  can 
be  represented  as: 
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D±  =  {0,1,...,^},     i  =  1,2,  ...,n  (4) 

In  general,  the  value  sets  may  differ  not  only  with  respect  to  their  size, 

but   also  with  respect  to  the  structure  relating  their  elements  (reflecting 

the  scale  of  measurement).  In  this  paper  we  will  restrict  ourselves  only  to 

the  case  of  nominal  or  linear  variables  (i.e.,  variables  with  unordered  or 

linearly  ordered  domains,  respectively).   A  sequence  of  values  of  variables 

x, .  x„,  ...,  x  ,  is  called  an  event: 
1   2        n  

e  =  (r1§  r2,  ....,  rn)  (5) 

where  r  e  D  ,  i  -   l,2,...,n. 

The  set  of  all  possible  events,  E,  is  called  the  event  space: 

1   =  {ei}f=l  (6) 

where  d  =  d  «d  •...•d   (the  size  of  the  event  set)  and  d.  =  H.    +   1. 
1   2      n  i    i 


Definition  1.  Given  two  events  e. ,  e_  in  £ ,  the  syntactic  distance, 
6(e  ,e„)  between  e.  and  e_  is  defined  as  the  number  of  variables  which  have 
different  values  in  e.  and  e_. 


Definition  2.     A  relational  expression 


[x±  #  R±]  (7) 

where  R  ,  called  the  reference  set,  is  one  or  more  elements  from  the  domain 
D  and  #  stands  for  one  of  the  relational  operators  =  ^  _>  j<,  is  called  a 
VL.  selector*  or,  briefly,  a  selector. 


*VL  stands  for  variable-valued  logic  system  VL    [13]   which  uses   such 
selectors. 
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Here  are  a  few  examples  of  a  selector,  in  which  variables   and   their 
values  are  represented  by  linguistic  terms: 
[height  «=  tall] 

[color  =  blue,  red]        (read:   color  is  blue  or  red) 
[length  >  2] 
[size  ^  medium] 
[weight  =  2.. 5] 

The  operator  . .  in  the  last  selector  denotes  the  range  of  values   from 

2   to   5,   inclusively.    It  is  used  when  the  domain  of  the  variable  is  a 

linearly  ordered  set.   A  selector  [x   #  R  ]  is  said  to  be  satisfied  by   an 

event   e  =  (x. ,x„,...,x  ),   if   the  value  of  x.  in  e,  is  in  relation  #  with 
i  2.  n  i 

any  element  of  R  . 

Definition  _3.   A  logical  product  of  selectors  is  called  a  VL   term: 

A   [x   #  R  ] 
iel    x    1  (8) 

where  I  c_   {l,2,...,n>,  and  R  _c  D  .   A  set  of  events  which  satisfy  a  VL. 
term  is  called  a  VL.  complex  or,  briefly,  a  complex. 

Thus  a  VL.  term  is  a  formal  representation  of  a  complex.  Since  these 
two  notions  have  a  one  to  one  correspondence,  we  will  use  them 
interchangeably,  unless  it  leads  to  a  confussion.  Therefore,  if  a  set- 
theoretic  notation  is  applied  to  a  term,  it  means  that  the  operation  is 
applied  to  the  corresponding  complex  (i.e.,  a  set  of  events  satisfying  the 
term).  A  complex  (  VL.  term  )  a  is  said  to  cover  an  event  e,  if  the  values 
of  variables  in  e  satisfy  the  relational   statements   (selectors)   in  the 
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complex  (term) . 

For  example,   event  e  =   (2,7,0,1,5,4,6)   satisfies   the   complex 
[x1  =  2,3]  [x3  _<  3]  [x5  -  3.  .8] . 

Let  E  be  a  set  of  events  in  E,  which  are  data  points  to  be  clustered. 
The  events  in  E  are  called  data  events  (or  observed  events)  and  events  in 
£  \  E  (i.e.,  events  in  £  which  are  not  data  events)  are  called  empty  events 
(or  unobserved  events  ) . 

Let  a  be  a  complex  which   covers  some   data  events   and  some  empty 
events. 

Definition  4^   The  number  of  empty  events   covered  by  a   is   called  the 
sparseness  of  a  and  denoted  by  s(a). 

Let  p(a)  denote  the  number  of  data  events   covered  by  a,   and  t(ot) 

denote   the   total   number  of   events   covered  by  a.   We  have  then 

t(a)  =  p(a)  +  s(a).   The  total  number  of  events   satisfying  the  complex 

a  =  A   [x  #  R  ]  is: 
iel 

t(a)  =  II   c(R.)  •   n   d 

iel    1    itfl  1  (9) 

where 

I  _c  {1,2,...  ,n) 

c(R  )  -  the  cardinality  of  R 

d   -  the  cardinality  of  the  value  set  of  variable  x . . 
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Definition  _5.  The  degree  of  generality  g(a)  of  complex  a  is  defined: 

/  \    i    t(a)    ,    ,.         s(a).  ,. -% 

*(ct)  "  1o8F(^)  =  log  (1  +7fc))  (10) 

The  value  — : — r  specifies  how  many  events  are  in  the  complex  per  one  data 
event.  Thus,  the  degree  of  generality  g(a)  specifies  the  uncertainty  of  the 
location  of  the  data  points  in  the  complex.  The  greater  the  degree  of 
generality  of  a  complex,  the  greater  is  the  uncertainty.  If  g  ■  0,  then 
all  the  events  in  the  complex  are  data  events.  We  can  see  from  (10)  that 
for  a  fixed  p(ot)  the  degree  of  generality  is  a  monotonically  growing 
function  of  sparseness. 


Let  L  be  a  set  of  complexes  (or  events) ,  and  R  be  the  set  of  all  the 
distinct  values  which  variable  x  takes  in  these  complexes  (or  events) . 

Definition  §_.        The  operation  which  transforms  L  into   the   complex 

n 

A    [x   =  R  ]   is   called  reference  union  or  refunion.   The  resulting 
i=l 
complex  is  called  the  minimal  covering  complex  or  mc-complex  for  L  and 

denoted  RU(L)  (refunion). 


If  any  R  =  D . ,  then  the  corresponding  selector  is  removed  from  the 
complex.  The  refunion  is  thus  a  transformation  which  transforms  a  set  of 
complexes  (or  events)  into  the  minimal  covering  complex. 

Theorem  1_.   The  mc-complex  of  an  event  set  has  the  minimum  sparseness  among 
all  complexes  covering  this  set. 

Proof:  Let  a  be  the  mc-complex  for  an  event  set  E: 
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a  =  RU(E)  =   A   [x  =R  ] 

i=l  (11) 


where  R  _c  D   (the  domain  of  x . ) .    Suppose  that  p  =   A    [x  =P  ]   is  a 

1=1 
complex  which  covers  E  and  has  a  smaller  sparseness  than  a.  If  this  is 

true,  then  there  must  exist  P  such  that  P.  c  R .  .  But  R .  ,  according  to   the 

definition  6,  contains  all  values  that  x  takes  in  events  in  E.  Therefore, 

if  P  c  R . ,  then  complex  a  could  not  possibly  cover  all  events  in  E,  which 

is  a  contradiction.  • 


Let  E  be  data  events  which  are  covered  by  a  complex  a  . 

Definition  7.  The  set  E  is  called  the  core  of  a,  and  the  complex  a*  =  RU(E) 
is  called  the  trimmed  a. 

From  Theorem  1  we  have  a*  c^  a. 

Theorem  2.  If  E.  and  E~  are  two  disjoint  event  sets  then: 

s(RU(E1))  +  s(RU(E2))  <  s(RU(E:  U  Y.^  )  (12) 

Proof:  According  to  Theorem  1,  RU(E.)  and  RU(E~)  have  the  smallest  possible 
sparseness  among  all  complexes  covering  E.  and  E„,  respectively.  Since  E. 
and  E.  are  disjoint,  then  (12)  must  hold.  • 

The  property  expressed  by  Theorem  2  has  an  analogy  in  statistical 

clustering,   where  with  the  increasing  number  of  clusters  the  'fit'  between 

each  cluster  and  the  probability  distribution  "fitted"  to  the  cluster  also 
increases. 
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Theorem  3.  Let  a  and  a„  be  two  intersecting  complexes,  whose  union  covers 
an  event  set  E.  Let  E.  (E_)  denote  the  set  of  events  in  a  (a  )  which  are 
covered  only  by  this  complex  (the  relative  core  of  the  complex).  Let  a' 
and  a."  be  any  two  disjoint  complexes  covering  the  same  event  set  E.  If 
RU(E  )  and  RU(E~)  are  disjoint  complexes,  then: 

sCRUCEj))  +  s(RU(E2))  <   sCop  +  s(op  (13) 

Proof:  The  theorem  is  an  immediate  consequence  of  Theorem  2  and  the  premise 
that  a'  and  a'  are  disjoint  complexes.  • 


We  will  next  introduce  two  basic  concepts  for  the  conceptual 
clustering  algorithm  presented  in  section  6.  They  are  the  star  of  an  event 
against  an  event  set  and  a  cover  of  an  event  set  against  another  event  set. 

Let  F  be  a  proper  subset  of  the  event  space  Z,  and  e  an  event  outside 
of  F,  i.e.  ,  e  a,  F. 

Definition  8.  The  star  G(e|F)  of  e  against  F  is  the  set  of  all  maximal 
under  inclusion  complexes  covering  the  event  e  and  not  covering  any  event 
in  F.  (A  complex  a  is  maximal  under  inclusion  with  respect  to  property  P, 
if  there  does  not  exist  a  complex  a*  with  property  P,  such  that  a  c  a*.) 

Let  E  and  E  be  two  disjoint  event  sets,  E  n   E_  =  <J>. 

Definition  9.  A  cover  COV(E  |E  )  of  E  against  E  is  any  set  of  complexes, 
{a.).rT»  such  that  for  each  event  e  e  E-  there  is  a  complex  a.,  j  e  J, 
covering  it,  and  none  of  the  complexes  a.  cover  any  event  in  E  .  Thus  we 
have: 
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E    c.     U    a        c  Z    \   E_ 

jeJ  3  (14) 


A  cover  in  which  all  complexes  are  pairwise  disjoint  sets  is  called  a 
disjoint  cover.  If  set  E-  is  empty,  then  the  cover  COV(E. |E?)  =  COV(E. |0) 
is  simply  denoted  as  COV(E  ) . 

Definition  10.  The  sparseness  (the  degree  of  generality  )  of  a  cover  is 
defined  as  the  sum  of  the  sparsenesses  (the  degrees  of  generality)  of 
complexes  in  the  cover. 

_3.   SUFFICIENCY  OF  COMPLEXES  AS  CLUSTER  REPRESENTATIONS 

First,  we  will  observe  the  following  property  of  complexes: 

Theorem  4.  For  any  given  event  space  E  and  integer  k  <  d  •d0,,,d  (where  d. 
—  I  Z         n        I 

is  the  cardinality  of  the  value  set  of  variable  x  ),  there  exist  k  pairwise 
disjoint  complexes  a  ,  a_,  ...,  a,  which  completely  fill  up  the  space  E, 
i.e. , 


j=l  J  (15) 


Proof:  The  theorem  is  equivalent  to  saying  that  any  event  space  can  be 
partitioned  into  an  arbitrary  number  of  complexes  (but,  of  course,  not 
larger  than  the  cardinality  of  Z) .  To  see  this,  take  any  subset  of 
variables  such  that  the  arithmetic  product  of  corresponding  d  -s  is  greater 
than  or  equal  to  k.  Let  R.,  j=l,2,...  denote  all  possible  sequences  of 
values  of  variables  x  ,   i  e  I. 
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Construct  complexes: 


"j  "  A   [Xi  "  rij] 
J    iel   x  1J  (16) 


where  r  ,  i  e  I,  j  ■  1,2,...,  denotes  a  value  of  variable  x  in  the 
sequence  R  .  Obviously,  the  complexes  a  are  pairwise  disjoint  and  fill  up 
the  space  E.  If  k'  >  k,  then  k'  -  k  complexes  are  joined  with  the 
remaining  ones  into  single  complexes,   according  to  the  formula: 

3txi  =  a]   v  0[x±  «  b]   =  B[x±-a,b]  (17) 

where  3  denotes  a  conjunction  of  selectors  involving  variables  other  than 
x  .  This  is  always  possible,  because  for  any  x  ,  i  e  I,  there  are  d. 
complexes  a  ,  which  differ  only  in  the  value  of  x  .  • 

From  the  view  of  clustering,  a  more  interesting  question  is  whether 
for  any  given  event  set  E  in  the  space  E,  there  always  exist  an  arbitrary 
number  k  _<  c(E)  of  pairwise  disjoint  complexes,  such  that  they  not  only 
fill  up  the  space  £,  but  also  partition  the  set  E  into  k  non-empty  subsets. 
A  positive  answer  to  this  question  would  imply  that  any  given  event  set  can 
be  partitioned  into  an  a  priori  assumed  number  of  subsets,  each  covered  by 
a  simple  complex,  disjoint  from  other  complexes.  The  answer  is  indeed 
positive.  In  fact,  even  a  stronger  property  holds,  as  stated  by  the 
following  theorem. 

Theorem  _5 .  (The  Sufficiency  Principle) 

For  an  event  space  E  and  any  data  event  set  E  =  ^ei»  eo»  •••»  ei,^» 
E  _c  T.    there  exists  at  least   one  set   of  k  pairwise  disjoint   complexes 

al>  a2»  •••»  ayii    such  that  each  complex  contains  one  data  event: 
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e,  e  a  ,    j  =  1,2, ...,k  (18) 


and  the  union  of  complexes  fills  up  the  space  E: 


k 


U  o  =  E 
j-1  J  (19) 


Proof: 


The  basic  idea  of  the  proof  is  to  show  that  for  any 
E  =  {e.,e-,..i,e,  },  E  _c  E,  it  is  always  possible  to  construct  a  tree,  in 
which  nodes  are  assigned  the  variables  x  ,  i  e  l,2,...,n,  branches  of  node 
x  are  assigned  elements  of  a  partition  of  D  (the  value  set  of  x  ),  and 
the  leaves  represent  complexes  a  ,  such  that  each  complex  covers  a  single 
event  e.,  and  the  union  of  complexes  fills  up  the  space  E. 

Suppose,  e  =  (Xj  ,x2  , . . .  ,x   ),  j  -  1,2, ...,k,  and  x   £  D^ 

Take  any  variable,  say  x  ,  which  has  different  values  for  events  in  E. 

Suppose   these  values  are  a.,  a_,  ...,  a   Partition  the  value  set,  D  of 

1   2        z.  p 

x  ,  into  subsets  {a.},  {a0},  . . . ,  {a   .},  A  ,  where  a  e  A  and  A  is  a  set 
p  12  z-lz         z    z      z 

D   \  {a. ,a„,a_, . . . ,a   .}.      It     is     obvious     that     complexes 

[x  =  a. ] ,  [x  =  a„],  ....,  [x  ■  A  ],  partition  both,  the  event  set  E  and 
p    1     p    2  p    z   r 

the  event   space  E   into  z  non-empty  subsets.   Suppose  these  complexes 

partition  E  into  E   ,  E   ,  ...,  E   and  E   into  E   ,  E   ,  ...,  E   ,   where 
a.    a_        A  a.    a_        A 

12         z  12         z 

E   c  E   . 

ai"  ai 

Variable  x  is  assigned  to  the  root  of  a  tree.   Branches  from  the  root 

are  assigned  values  a.,  a_,  .«.,  A  .   Leaves  of  this  tree  correspond  to 

J.     z.  z 

complexes    [x  =  a, ] ,  [x  ■  a»] ,  ...,  [x  =  A  ],   covering   event   sets 
p    1     p    2  '        p    z  ° 
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Eal'    Ea2'    * 


,    E      ,    respectively    (Fig.    3) 


CXp  sq,3 


CXp-a,3[Xr-By] 


Constructing  a  tree  for  the  proof  of  the  sufficiency  principle 

Fig.  3. 
For  every  one  of  the  above  event  sets  which  has  more  than  one   element 

repeat   the  above  process  with  the  following  modification.   Suppose  E   has 

3.  J. 

more  than  one  element  and  x  takes  values  b,  ,  b_,  ....  B   for  events   in 

r  1*   2        y 

E  ..   Assign  x  to  the  root  of  a  new  tree,  and  attach  the  tree  to  the  leaf 
al        °   r 

corresponding  to  E    (i.e.,  to  the  leaf  marked  by   [x  =  a. ]   in  Fig   3). 

d  1  P        X 

Assign  the  branches  emanating  from  this  root  values  b.  ,  b„,  ...,  B  ,  where 

B   =  D   \  {b  ,b_,...,b   .}.   It  is  obvious  that  complexes: 
yy     s  2      y-1 

[x  =  &1]  [xr  =  bx]  ,  [x  =  ax]  [xr  -  b2] ,  . . . ,  [x  =  ax]  [xr  =  By]   (20) 
partition  both,  the  set  E   and  the  set  Z  .  into  y  disjoint  subsets. 


This  process  is  continued  until  leaves  of  the  obtained  tree  correspond 
to  complexes,  each  of  which  covering  only  one  event  from  E.  Because  every 
step  of  this  process  partitions  simultaneously  events  in  E  and  in   £,   the 
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union  of  the  obtained  complexes  covers  E  and  fills  up  the  whole  space  E. 
Thus,  these  complexes  constitute  the  desired  set  <a  ,a  , . . .  ,a,  >.         • 

This  above  theorem  asserts  that  the  space  of  all  complexes  is 
sufficient  to  be  a  space  of  cluster  representations,  because  any  event  set 
can  be  clustered  into  an  arbitrary  number  of  complexes.  The  theorem  is 
used  as  the  theoretical  basis  for  the  clustering  algorithm  described  in 
section  6. 

As  the  above  proof  indicates,  there  usually  will  be  many  covers  which 
constitute  a  k-partition  of  any  given  event  set.  Therefore,  a  question 
arises  as  to  which  cover  to  select  as  the  most  desirable.  In  order  to 
answer  this  question,  a  criterion  of  the  quality  of  a  cover  is  needed. 

4- .   A  CRITERION  FOR  EVALUATING  QUALITY  OF  CLUSTERING 

Let  E  be  the  set  of  data  points,  and  COV(E)  a  disjoint  cover  of  E. 
Such  a  cover  implies  a  partition  of  E  into  clusters,  each  cluster  being  the 
event  set  contained  in  one  complex.  The  sparseness  (or  the  degree  of 
generality)  of  the  cover  could  be  used  for  defining  a  criterion  of  quality 
of  a  partition.  However,  if  E  is  partitioned  into  individual  events, 
then,  obviously,  the  sparseness  (  as  well  as  the  degree  of  generality) 
will  be  zero.  Consequently,  this  kind  of  criterion  can  be  used  only  if  the 
number  of  clusters  is  assumed  a  priori,  i.e. ,  for  a  constrained  clustering 
problem.  In  this  case  the  problem  is  to  find  a  disjoint  cover  of  E  with  k 
complexes,  whose  sparseness  (or  the  degree  of  generality)  is  minimum.  In 
the  case  of  a  free  clustering  problem  (i.e.,  when  the  number  of  clusters  is 
not   assumed  a  priori),   a   criterion  of  quality  of  partitioning  has  to 
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involve,  in  addition  to  sparseness  (or  the  degree  of  generality),  also  some 
"cost"  function  dependent  on  the  number  of  clusters,  e.g.,  a  measure  of 
complexity  of  a  cover.  In  this  paper  we  are  concerned  only  with  the 
constraint  clustering  problem.  Although  it  may  seem  otherwise,  this  is  not 
a  serious  limitation  because  interesting  practical  solutions  of 
clustering  problems  should  not  produce  more  than  just  a  few  clusters  (this 
is  so,  because  when  the  number  of  clusters  is  large,  humans  prefer  to 
organize  them  into  an  hierarchy).  Consequently,  to  obtain  a  general 
solution,  a  constraint  clustering  algorithm  should  be  repeated  for 
several  different  k,  and  the  best  obtained  partition  selected  as  the 
general  solution. 

The  sparseness  (or  the  degree  of  generality)  may  not  be  sufficient  as 
the  sole  criterion  for  selecting  a  cover.  One  may  seek  a  cover  which 
exhibits  other  properties  than  minimum  sparseness.  In  order  to  use  several 
criteria  for  selecting  a  cover  simultaneously,  we  adopt  the  lexicographic 
cost  functional  defined  in  [14] . 

A  lexicographic  evaluation  functional  (  LEF)  is  defined  as  a  pair  of 
two  lists: 

A  =  <a-list,T-list>  (21) 


where  a-list  =  (a  ,a_, . . . ,ap ) ,  is  a  list  of  attributes  used 
to  evaluate  a  cover 
T-list  =  (t  ,t  ,  . . .  ,t  ) ,  is  a  list  of  "tolerances"  assigned  to 
the  attributes  a  . , respectively ,  0  _<  T  _<  1. 
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Let  V  ,  j  =  1,2,...  denote  all  possible  disjoint  covers  of  the  event 
set  E.  Let  V  denote  one  of  the  covers,  and  let  a . (V  )  denote  the  value  of 
attribute  a.  for  cover  V  .  Cover  V  is  said  to  be  optimal  (minimal)  under 
functional  A  if  for  every  j: 

A(V)  <•  A(V  )  (22) 

where 

A(V)  =  (a^V),  a2(V),...,a£(V)) 

A(V  )  =  (ax(V  ),  a2(V  ) a£(V  )),    j  =1,2,..., 

x 
and  <•  is  a  relation,  called  the  lexicographic  order  with  tolerances,  which 

holds  if: 

al(V  "  al(V)  >  Tl 
or  la^V  )  -  a^V)!  <  Tj  and  a2(V  )  -  a2(V)  >  ?2 


or 


(23) 


where 


or  and  a£  (V  )  -  a£  (V)   >  0 


Ti  =  V(aimax-aimin^   i  =  L2,. ..,*-! 


aimax  =^x{a±(V  )>, 


aimin  =  mjin{ai(Vj)> 

Note  that  if  t  =  (0,0,..., 0)  then  <•  denotes  the  lexicographic  order 
in  the  usual  sense.   In  this  case,  A  can  be  specified  just  as  A  =  <a-list>. 
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To  specify  the  a  functional  A  one  selects  a  set  of  attributes,  puts 
thera  in  the  desirable  order  in  the  a-list,  and  sets  the  values  for 
tolerances  in  the  T-list. 

Relation  <•  partitions  all  covers  into  equivalence  classes  and  orders 
the  classes  linearly,  with  the  first  class  containing  one  or  more  optimal 
covers,  and  the  next  classes  containing  consecutively  less  optimal  covers. 

Below  are  a  few  criteria  which  may  be  used  to  assemble  an  a-list: 

•  Sparseness  (or  generality  g)  of  a  cover.  Minimizing  sparseness  will 
produce  complexes  which  "fit"  as  closely  as  possible  to  clusters  of  data 
events.  This  criterion  is  an  analog  to  the  criterion  of  minimizing  intra- 
distances  in  the  conventional  distance-based  clustering. 

•  Intersection,  defined  as  the  average  degree  of  intersection  (DI)   between 

any   two  complexes  in  the  cover.   The  DI  between  two  complexes  is  the  total 

number  of  selectors  which  remain  in  both   complexes   after  removing  every 

pair   of   disjoint  selectors   (selectors  whose  reference  sets   do  not 

intersect) .   For  example,  the  degree  of  intersection  between  complexes 

[x2=2,3]  [x4=3,5,7]  [x5=2..5] 
and 


\  f 


[Xl=3]  [x2=l]  [x4=5.  .12]  [x5=l] 
is  3. 


The  introduction  of  DI  as  a  criterion  for  clustering  comes  from  the 
observation,  that  people  tend  to  prefer  partitions  of  objects,  in  which 
clusters  differ  not  in  just  one,  but  in  many  characteristics.  This 
criterion  is  an  analog  to  the  criterion  of  maximizing  cluster  inter- 
distances  in  distance-based  clustering. 
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•  Imbalance,  defined  as 


k' 
1/k  T  |l/k-c(E)  -  c(E  n  o  )| 

i-1  (24) 


where  c(E)  is  the  size  of  the  event  set,  and  c(E  r\  a . )  is  the  number  of 
data  events  covered  by  complex  a  (the  cardinality  of  the  core  of  a.).  The 
imbalance  measures  the  variability  of  cluster  sizes. 


•  Dimensionality  ,  defined  as  the  total  number  of  different  variables 
involved  in  the  complexes  of  the  cover.  The  dimensionality  tells  us  how 
many  variables  are  used  to  describe  clusters,  and,  thus,  how  many  variables 
have  to  be  measured  to  classify  objects  into  these  clusters. 

_5 .   PROCEDURES  STAR  and  NIP 

Before  describing  an  algorithm  for  conceptual  clustering  (next 
section)  we  shall  first  describe  two  important  procedures  used  in  this 
algorithm:  STAR  and  NID.  Procedure  STAR  generates  the  star  (def.  8)  of  a 
data  event  against  a  set  of  other  data  events,  and  procedure  NID  transforms 
a  non-disjoint  cover,  whenever  possible,  into  a  disjoint  cover  with  the 
same  number  of  complexes. 

Procedure  STAR: 

This  procedure  is  based  on  the  algorithm  described  in  [14] . 


Let  e  be  an  event  and  a  a  complex.   The  operation  e   |—  a  (read:    e 
o  ™t-  r         Q  i  0 

extended  in  a) is  defined: 
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e      f—  a 
o 


a,    if   e     e   a 
o 


(25) 


<J> ,    otherwise 


Let   event   e.    =    (r.,r_,...,r    )    and     e     +  e   .        The     operation     e     — I    e. 
L  1     2  n  io  r  ol 


(read:    e     extended  against   e, )    is   defined: 
o 1 


%  "I    el   "  0    (eo    I-   [xi  '  rl 
iel 


]) 


(26) 


Let  G  (e|E)  denote  the  union  of  complexes  from  the  star  G(e|E).    It 
can  be  shown  that: 


GU(e|E) 


=  r\ 


e  eE 


(e  -H  .  ) 


(27) 


To  obtain  the  star  G(e|E)  from  G  (e|E),  the  right-hand  side  of  (27) 
must  be  converted  to  the  union  of  maximal  (under  inclusion)  complexes. 
Such  a  union  is  obtained  when  the  set-theoretical  multiplication  is  done 
with  the  application  of  absorption  laws. 

Procedure  NIP: 

(A  transformation  of  a  _non-disjoint   cover  _into  a  disjoint   cover) 


Let   {a    ,a    ,  . . .  ,a    }  be  a  set      of     not    necessarily     disjoint      complexes, 
which   is   a   cover   of   a   data  event   set   F. 
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1.   Let  c(a  ),  i  =  1,2,..., A,  denote  the  cardinality  of  a    (the  total 

number   of  events  covered).   Determine  the   (arithmetic)  j|um  of 

cardinalities: 

I 
sc  -  I  c(o  )  (28) 

i-1 

and  the  cardinality  of  the  (set-theoretic)  sum  of  complexes: 


cs  «  c(  U  o  )  (29) 

i-1 

2.   If  sc  =  cs  then  STOP:  L  is  already  a  disjoint  cover. 


3.  For  i  =  1,2,  ••.,£,  determine  the  relative  core,   CORE  ,   of  complex 

a  ,i.e.,  the  set  containing  data  events  covered  by  complex  a  and  only 

by  this  complex.   Let  RESIDUE  denote  the  set  of  remaining  events, 

I 
i.e.,  RESIDUE  -  F  \  U   CORE  . 

i=l 

4.  For  each  CORE  determine  its  mc-complex  (def.  6): 

a°  =  RU(CORE  ),    i  =  1,2,...,*  (30) 

5.  If  any  two  complexes  a  intersect,   then  STOP.   The  disjoint  cover 
cannot  be  obtained.   (This  is  a  direct  consequence  of  Theorem  1) 


6.    Select  an  event  from  RESIDUE  and  call  it  e.  Delete  e  from  RESIDUE. 


7.  For  each  pair  (e,a  ),  i  =  1,2,...,£,  determine  the  covering  complex: 

a*  =  RU({e>  n  a°)  (31) 

8.  Delete  every  a  which  intersects  with  any  a  ,  j  +  i.   If   all  a   are 
deleted  then  STOP:   a  disjoint  cover  cannot  be  obtained. 
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9.    Select  the  best  complex,  Best-a,  among  complexes  a  ,  according  to   the 
LEF: 


where 
Aspars 


res 


Asel 


<(Aspars,  -res,  -Asel),(T  ,t  ,t  )> 


1 


-  the  difference  between  the  sparseness  of  a   and  a 

-  the  number  of  events  in  RESIDUE  covered  by  a 

-  the  difference  between  the  number  of  selectors  in  a   and 


T1,T2,T3 


tolerances  are  set  to  0  by  default. 


The  sign  '-'  in  front  of  res  and  Asel   indicates   that   the  algorithm 
will  maximize  these  criteria  (by  minimizing  the  negative  value). 

10.  Suppose  Best-a  was  created  by  joining  e  with  a  .   Assign  to  a   a  new 
value  Best-a. 

11.  If  RESIDUE  =  <(>,  then  END,  otherwise  go  to  6. 


The  output  from  this  procedure  is  either  a  disjoint  cover 
{a  ,a  , . . .  ,a  }  of  set  F,  or  an  indication  that  such  cover  cannot  be 
obtained  from  the  initial  cover  {a . ,a„, . . .  ,af }. 


6.   AN  ALGORITHM  FOR  CONJUNCTIVE  CONCEPTUAL  CLUSTERING 


6.1.   An  Overview 


Based  on  the  ideas  described  in  previous  sections,  we  have  developed 
an  algorithm  for  conjunctive  conceptual  clustering,  called  PAF.*  Given  a 


*Polish-American-French 
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set,  E,  of  events  from  an  arbitrary  event  space,  and  an  integer  k,  PAF 
partitions  E  into  k  clusters,  each  of  which  has  a  conjunctive  description 
in  the  form  of  a  VL  complex.  The  obtained  partition  is  optimal  or 
suboptimal  with  regard  to  a  lexicographic  evaluation  functional,  assembled 
by  a  user  from  the  criteria  listed  in  the  previous  section. 

The  general  structure  of  the  algorithm  is  based  on  the  multicriteria 
dynamic  clustering  method  developed  by  Diday  and  his  collaborators  (Diday 
and  Simon  [1],  Hanani  [15]).  Underlying  notions  of  the  dynamic  clustering 
method  are  two  functions: 

g  -  the  representation  function,  which,  given  k  clusters  of  a  partition 

of  E  (a  k-partition)  produces  a  set  of  k  cluster  representations, 
called  kernels.  There  may  be  diferent  kinds  of  kernels,  e.g.,  the 
center  of  gravity  of  a  cluster,  a  few  selected  points  from  a  cluster,  a 
probability  distribution  best  fitting  the  cluster,  a  linear  manifold  of 
minimal  inertia,  etc. 

f  -  the  allocation  function,  which,  given  a  set  of  kernels, 

partitions  E  into  k  clusters,  "best  fitting"  these  kernels. 

The  method  works  iteratively,  starting  with  a  set  of  k  initial, 
randomly  chosen  kernels  (of  a  given  kind).  A  single  iteration  consists  of 
an  application  of  function  f  to  given  kernels,  and  then  of  function  g  to 
the  obtained  partition.  An  iteration  ends  with  a  new  set  of  kernels.  The 
process  continues  until  the  chosen  criterion  of  quality  of  a  partition,  W, 
ceases  to  improve.  (Criterion  W  measures  the  "fit"  between  a  partition  and 
kernels.)  It  has  been  proven  [1],  that  this  method  always  converges  to  a 
local  optimum. 
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The  measure  W  can  be  a  single  criterion,  or  a  sequence  of  criteria. 
In  the  nultlcrlterla  case,  for  each  criterion  an  appropriate  type  of 
kernels  Is  used  (Hananl  [15]). 

The  algorithm  PAF  applies  a  multicriteria  dynamic  clustering  method, 
In  which  the  basic  and  final  cluster  representation  is  a  VL  complex. 
Intermediate  representations  include  the  geometrical  center  of  a  cluster 
(using  the  syntactic  distance;  def.l)  and  the  "most  outstanding"  event 
(most  distant  from  the  center)  in  a  cluster. 

The  use  of  the  latter  representation  is  an  application  of  an 
"adversity  principle."  This  principle  states  that  if  the  most  outstanding 
event  truly  belongs  to  the  given  cluster,  then  if  it  serves  as  the  cluster 
representation,  then  the  "fit"  between  it  and  other  events  in  the  same 
cluster  should  still  be  better  than  the  "fit"  between  it  and  events  of  any 
other  cluster. 

In  the  algorithm  PAF,  the  measure  of  "fit"  between  a  data  event  and  a 
kernel  (a  VL.  complex)  is  a  binary  measure,  defined  by  a  predicate 
specifying  whether  an  event  satisfies  the  complex  or  not.  A. complex  is  a 
form,  which  can  describe  a  very  large  number  of  configurations  of  events. 
For  n  variables,  each  taking  d  distinct  values,   there  are  N  =   (2  -1) 

different   complexes.   For  example,  if   n  ■   10  and  d  =   7,  then  N  is 

20 
approximately  10   .   Such   a   large  size   of   the   "concept  space"  makes 

conjunctive  clustering  computationally  an  extremely  complex  problem.   To 

obtain  a   feasible  practical  solution,   it   is  necessary   to  apply   a 

combination  of  carefully  designed  heuristic  search  methods.   In  PAF,  one  of 

the  methods  used  is  a  well  known  "best  first"  search   technique  developed 
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in  artificial  intelligence  [16] . 

6.2  Description  of  PAF 

A  flow  diagram  of  algorithm  PAF  is  shown  in  Fig.  4. 

1.  In  the  first  step  (block  1),  a  set  of  k  data  events 
E  =  {e. ,  e_, . . . ,e,  >,  called  seeds,  is  selected  from  the  event  set  E. 
Seeds  can  be  selected  arbitrarily,  or  they  can  be  chosen  as  events 
which  are  most  distant  syntactically  (def.l  )  from  each  other.  In  the 
latter  case  the  algorithm  will  generally  converge  faster.  For 
selecting  such  events  program  ESEL  [17]  can  be  used. 


2.    For  each  seed  e  ,  i  =  l,2,...,k,   a  star   is   generated  against   the 
remaining  seeds  (using  procedure  STAR  described  in  sec.  5): 


G±  =  G(ei|Eo  \  <ei>),     i  =  1,2 k 

3.   From  each  star  a  complex  is  selected,  such  that  the  resulting  set  of  k 
complexes: 

(i)   Is  a  disjoint  cover  of  E 

(ii)  Is  an  optimal  or  suboptimal  cover  among  all  possible  such  covers, 
according  to  an  assumed  criterion  LEF  (constructed  by  a  user  from 
criteria  listed  in  sec. 4:  sparseness  or  generality,  intersection, 
imbalance  and  dimensionality).  This  is  the  most  difficult  and 
computationally  costly  step  of  the  algorithm.  It  can  be 
performed  in  a  number  of  different  ways.  We  will  distinguish 
between  three  different  procedures:  P  (parallel),  PS  (parallel- 
sequential)   and  S  (sequential).   These  procedures  are  described 
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Given : 

-  a  set  of  data  events 

-  the  desired  nr  of  clusters 

-  the  evaluation  functional 


Using  procedure  STAR  determine  the  star 
of  each  seed  against  the  remaining  seeds 
Select  from  eacli  star  one  complex,  SO 
that  the  obtained  collection,  1',  of   k 
complexes  will  he  the  "best"  disjoint 
cover  of  E  (with  help  of  NID  procedure) . 


Vt7 


Is  the  termination 
criterion  appl ied 
to  P  satisfied? 


Yes 


W 


Is  iteration 
odd  or  even 


W 


Choose  k  new  seed 
events  which  are 
central  in  the 
complexes  in  P 


<^7 


Choose  k  new  Seed 
events  which  are 
extreme  in  the 
complexes  in  P 


A  flow  diagram  of  algoritm  PAF 


Figure  4. 
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in  the  next  section. 

4.  A  termination  criterion  of  the  algorithm  is  applied  to  the  obtained 
cover.  The  termination  criterion  is  a  pair  of  parameters  (b,p),  where 
b  (the  base)  is  a  standard  number  of  iterations  the  algorithm  always 
performs,  and  p  (the  probe)  is  the  number  of  iterations  beyond  b, 
which  the  algorithm  performs, after  each  iteration  which  produces  an 
improved  cover. 

5.  A  new  set  of  seeds  is  determined.  If  the  iteration  is  odd,  then  the 
new  seeds  are  data  events  in  the  centers  of  complexes  in  the  cover 
(according  to  the  syntactic  distance) .  If  the  iteration  is  even,  then 
the  new  seeds  are  data  events  maximally  distant  from  the  centers 

(Recording  to  the  "adversity  principle"). 

]_.      PROCEDURES  P,  SP  AND  S^ 

All  three  procedures  use  bounded  stars,  that  is  stars  whose  size  is 
limited  by  special  parameter  MAXSTAR.  The  reason  is  that  the  size  of  stars 
may  be  very  large  when  the  number  of  variables  n  is  high.  As  can  be  seen 
from  procedure  STAR,  the  upper  bound  on  the  number  of  complexes  in  a  star 
grows  exponentially  with  k  (the  number  of  clusters);  namely  n  .  The  size 
of  any  star  is  controlled  by  not  allowing  it  to  have  more  than  MAXSTAR 
complexes.  Whenever  a  star  exceeds  this  number,  complexes  are  ordered  in 
the  order  of  ascending  sparseness,  and  only  first  MAXSTAR  complexes  are 
retained.  It  is  also  assumed  that  all  complexes  in  stars  are  trimmed  (i.e., 
the  refunion  operation  is  applied  to  the  core  of  each  complex,  and  then 
the  resulting  mc-complex  is  used  to  replace  the  original  complex  in  the 
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9tar;  see  def .7) . 

To  simplify  the  description  of  procedures  we  will  assume  that  the 
criterion  of  clustering  optimality  is  minimizing  the  sparseness  of  the 
disjoint  cover  (representing  a  partition).  The  procedures  can  be  extended 
for  a  multicriteria  case  by  using  a  criterion  LEF  (which  imposes  a  linear 
order  between  equivalence  clsses  of  sets  of  complexes).  In  such  a 
multicriteria  case,  however,  sparseness  should  be  used  as  the  primary 
criterion  in  order  to  retain  the  properties  of  the  described  procedures. 

Procedure  P_ 

The  procedure  is  applicable  for  relatively  small  MAXSTAR  and  k.  It  is 
particularly  useful  for  execution  on  a  parallel  processor.   Let  star 

G   =  G(e  |E   \  {e  })  be  a  set  <a  ,a  , . . .  ,a   >,  i  =  l,2,...,k.   Assume   that 

i 
complexes  a  ,   j=0,l,...,g  ,  are  ordered  in  ascending  order  on  sparseness. 

The  position  of  a  complex  in  the  star  so  ordered  (indicated  by  a  subscript, 

which  counts   from  0),   is   called  the   rank  of  the  complex  (thus,  e.g., 

complex  a  has  rank  2) . 

Taking  one  symbol  a.  from  each  star  G . ,   i  =  1,2,..., k,   at  a  time, 

generate  all  possible  sequences: 

/  1»  2      k, 
P  =  (a  a  ,  ...,ot  ) 

0  o  o      o 

,  1   2      k-1  kN 

1  o   o      o    I 

(32) 

P  =  {a   ,a   ,...,a   > 

r     8X   82      8k 


where  r  =  (8^1)  (82+l) . . .  (g  +i) 
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The  sum  of  the  ranks  of  complexes  in  any  such  sequence  is  called  the 
pathrank.  Assume  that  sequences  P  ,  j  *  1,2,...,T  are  now  arranged  in 
ascending  order  on  their  pathrank,  with  sequences  of  equal  pathrank  ordered 
arbitrarily.  As  before,  P  has  pathrank  0  (because  all  complexes  in  P 
have  rank  0).  P.  ,P„, . . .  ,P,  ,  however,  denote  sequences  with  pathrank  1,  and 
P„  denotes  a  sequence  with  pathrank  g,+g9+» • «+g,  • 

Considering  sequences  P   in  the  ascending  order  on  their   pathrank, 
the  following  operations  are  performed  on  each  sequence: 


(i)   A  P  is  tested  whether  it  is  a  cover  of  E.   This  can  be  done  by 

consecutively   removing  from  E  data  events  covered  by  each  complex  in 

P  .   If  at  the  end  E  becomes  the  empty  set,  P  is  a  cover.   If  a  P  is 
not  a  cover,  it  is  removed  from  further  consideration. 


(ii)  A  P  is  tested  whether  it  is  a  disjoint  cover.  If  it  is,  its 
sparseness  is  calculated.  If  it  is  not,  a  lower  bound  (l.b.)  on  the 
sparseness  of  a  possible  disjoint  cover  is  calculated  (without 
actually  determinig  the  disjoint  cover). 

The  l.b.  is  computed  by  determining  the  relative  core  of  each 
complex  (i.e.,  data  events  covered  only  by  the  given  complex  and  not 
by  any  other  complexes) ,  and  then  computing  the  sparseness  of  the  mc- 
complex  of  the  core.  The  l.b.  is  the  sum  of  so  obtained  sparsenesses 
(this  computation  is  based  on  theorem  3).  {The  purpose  of  using  the 
l.b.  is  to  avoid,  whenever  possible,  the  computationally  costly 
procedure  NID.} 
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(iii)  If  the  computed  sparseness  (or  l.b.)  i8  not  a  new  minimum  (i.e.,  ia 
not  8maller  than  the  aparaene88  of  the  be8t  cover  obtained  so  far), 
then  the  cover  is  removed  from  further  consideration.  Otherwise,  if 
it  is  a  disjoint  cover,  it  is  retained  as  the  best  cover;  and  if  it  is 
a  non-disjoint  cover,  it  is  transformed  by  NID,  if  possible,  into  a 
disjoint  cover  (note  that  some  operations  of  the  NID  procedure  were 
already  done  in  (ii)).  If  the  sparseness  of  the  obtained  disjoint 
cover  still  represents  a  new  minimum,  the  cover  is  retained  as  the 
best  so  far.  If  the  sparseness  is  not  a  new  minimum,  or  NID  fails  to 
produce  a  disjoint  cover,  the  cover  is  removed  from  further 
consideration. 

The  disjoint  cover  retained  at  the  end  of  the  above  search  process 
through  sequences  P  is  the  output  of  the  procedure.  It  is  a  minimum 
sparseness  cover  which  can  be  assembled  from  complexes  in  the  given  stars. 
The  existence  of  at  least  one  disjoint  cover  is  assured  by  the  sufficiency 
principle.  An  advantage  of  the  above  described  ordering  of  sequences  P 
is  that  the  best  cover  will  most  likely  be  close  to  the  beginning  of  the 
list.  Therefore,  if  the  number  of  sequences  is  very  large,  the  search  can 
stop  before  reaching  the  end,  with  a  low  risk  of  loosing  the  optimal 
solution. 


Procedure  PS 

In  procedure  P,  all  sequences  P  were  generated  first,  and  then 
linearly  searched  in  order  to  determine  the  best  cover.  In  this  procedure, 
the  search  for  the  best  cover  is  done  during  the  process  of  generating  the 
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sequences,  using  the  "best  first"  search  strategy  (Winston  [16]). 
Specifically,  the  search  is  based  on  the  algorithm  A*  (Nilsson  [18]).  At 
step  a  complex  is  added  to  the  partial  cover  (a  partial  sequence  after 
application  of  NID)  which  most  likely  leads  the  optimal  cover  (according  to 
an  evaluation  function).  This  process  avoids  testing  (usually  many) 
sequences  P  ,  for  which  it  is  possible  to  predict  that  they  will  not 
produce  an  optimal  cover.  The  procedure  PS  is  especially  applicable  when 
stars  G  are  large. 

Fig.  5  illustrates  the  search  process.  Branches  emanating  from  a  node 
at  level  i  represent  complexes  in  star  G  .  A  path  from  the  root  to  a  node 
at  level  i  represents  a  partial  disjoint  cover  with  i  complexes.  When  i=k, 
the  path  represents  a  complete  disjoint  cover  (corresponding  to  some 
sequence  P  to  which  NID  was  applied). 

In  the  first  step,  sequence  P  ■  (a  ,a  , . . . ,a  )  is  generated.  (It  is 
the  sequence  of  complexes  of  the  smallest  sparseness).  The  relative  core 
of  each  complex  is  determined  and  then  the  mc-complex  is  constructed  for 
each  core.  Let  s..  ,s_, . . .  ,s,  denote  the  sparsenesses  of  the  obtained  mc- 
complexes.  On  the  basis  of  theorem  3,  the  sum  s.  +  s9  +  •••+  s.  specifies 
a  lower  bound  on  the  sparseness  of  the  best  disjoint  cover  which  can  be 
built  from  complexes  of  given  stars. 

In  the  next  step,  node  (1)  (fig.  5)  is  expanded,  i.e.,  a  is  paired 
with  every  complex  in  G  ,  procedure  NID  is  applied  to  each  pair,  and  then 
the  sparseness  is  calculated  for  the  obtained  disjoint  pair.  If  NID  fails, 
the  path  is  abandoned.  The  obtained  pair  is  a  partial  cover  with  i=2 
complexes.  Nodes  corresponding  to  generated  partial  covers  (including   the 
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•^  *- 


(LEVEL  1) 


(LEVEL2) 


(LEVEL  3) 


(35)   (24] 1X^(19)  (21) 

OPTIMAL  COVER 


ao   (LEVEL  k) 


(24)  (27)  (24) 


A  search  tree  for  an  optimal  cover 
Fig.  5. 
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remaining  complexes  in  G.)  are  assigned  a  value  of  the  evaluation  function: 

f  =  h  +  g  (33) 

where 

h  -  is  the  sparseness  of  the  obtained  partial  disjoint   cover 

g  -  is  the  sum  s   .  +. s .  _  +  ...  +  s,  ,  where  i  is  the  number  of   complexes 

in  the  partial  cover. 

{g  represents  a  l.b.  on  the  sparseness  of  the  remaining  complexes  to 

be  determined,   i.e.,  complexes  that  are  needed  to  complete  the  cover 

under  construction} 

According  to  the  best  first  strategy,  the  node  to  be  expanded  at  each 
step  is  the  one  which  is  associated  with  the  lowest  value  of  the 
evaluation  function.  It  is  proven  that  such  strategy  will  produce  the 
optimal  cover  [18].  The  order  of  expanding  nodes  in  the  tree  in  Fig.  5  is 
shown  by  numbers  in  circles.  The  value  of  the  evaluation  function 
associated  with  each  node  is  given  in  parentheses. 

Procedure  S^ 

This  procedure  is  like  procedure  PS,  with  the  exception  that  stars  are 
not  generated  beforehand.  When  expanding  a  node  in  the  search  tree,  rather 
than  taking  complexes  from  already  determined  stars,  an  appropriate  star  is 
generated  each  time.  This  requires  a  multiple  repetition  of  the  star 
generation  process,  but  saves  on  the  memory  for  storing  all  stars  (which 
may  be  large  sets). 
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8.      A  NOTE  ON  IMPLEMENTATION  AND  AN  EXAMPLE 

The  algorithm  has  been  Implemented  by  R.  Stepp  in  PASCAL  for  Cyber 
175.  The  details  on  the  implementation  are  in  [19].  For  illustration  we 
will  briefly  describe  two  examples,  which  were  used  In  testing  experiments 
with  the  program. 

Figure  6a  represents  a  diagrammatic  representation  [20]  of  an  event 
space,  spanned  over  variables  x.,x_,x-,x,,  with  domain  sizes  2,  5,  4,  2, 
respectively.  Each  cell  represents  one  event.  Cells  marked  by  1  represent 
data  events,  remaining  cells  represent  empty  events.  Fig.  6a  also  shows  a 
cover  obtained  from  the  first  iteration  of  the  algorithm.  The  remaining 
figures  show  results  from  the  consecutive  iterations.  Cells  representing 
seed  events  in  each  iteration  are  marked  by  +  .  The  partition  evaluation 
criterion  was  a  LEF: 

<(sparseness,  imbalance,  dimensionality)  (0,  0,  0  )> 
According  to  this  criterion,  the  best  partition  is  the  one  shown  in 
Fig.  6c.   The  partition  is  specified  by   comple  es: 

a°   =   [xx  =  0]  [x2  =  1]  [x4  =  0] 

a°   =  [x1    =  0]  [x2  =  2]  [x3  -  1..3] 

a°  =   [Xj  =  1]  [x2  =  1..3] 

Another  experiment  with  the  program  involved  clustering  47  cases  of 
soybean  diseases.  These  cases  represented  four  different  diseases,  as 
determined  by  plant  pathologists  (the  program  was  not,  of  course,  given 
this  information) .  Each  case  was  represented  by  an  event  of  35  many-valued 
variables.  With  k=4,  the  program  partitioned  all  cases  into  four 
categories.    These  four  categories   turned  out   to  be  precisely   the 
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Sparseness  =  18 
lmbalance=  1.6 
Dimensionality2  3 


ITERATION  2 


Sparseness  =  20 
Imbalance  =    3.6 
Dimensionality  ■.     3 


c. 


ITERATION  3 
(Optimal  solution) 
Sparseness  =  1 2 

Imbalance3  7.3 
Dimensionality2  4 


ITERATION  4 


Sparseness  =  16 
Imbalance5  3.6 
Dimensionality5  3 


Fig.    6. 
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categories  corresponding  to  individual  diseases.  The  complexes  defining 
the  categories  involved  known  characteristic  symptoms  of  the  corresponding 
diseases. 

9.-   CONCLUSION 

The  paper  presented  a  theoretical  foundation  and  an  algorithm  for 
conceptual  clustering,  in  which  entities  are  assembled  into  classes 
described  by  single  conjunctive  concepts  (VL.  complexes).  Thus,  the 
proposed  approach  produces  clusters  together  with  their  descriptions.  The 
descriptions  are  conjunctive  statements  involving  relations  on  variables 
characterizing  the  entities,  and  have  a  simple  linguistic  interpretation. 

The  presented  algorithm  has  been  implemented  and  tested  on  various 
examples.  The  results  indicate  that  the  method  provides  an  valuable 
alternative  to  the  conventional  clustering  methods,  and  has  a  potential  for 
application  in  variety  of  clustering  problems. 
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